In [1]:
%matplotlib inline
from utils import *

BLUF - Bottom Line Up Front

Clustering using CyberActivd graph embedding was mostly sucessful.

For each graph type - YouTube, GMail, VGame, Attack, Download, and CNN - the top content of each cluster consists of one uniform type of subgraphs.

StreamSpot Dataset

Tab-separated file with one edge on each line in the following format:

source-id source-type destination-id destination-type edge-type graph-id

Graph ID's correspond to scenarios as follows:

  • YouTube (graph ID's 0 - 99)
  • GMail (graph ID's 100 - 199)
  • VGame (graph ID's 200 - 299)
  • Drive-by-download attack (graph ID's 300 - 399)
  • Download (graph ID's 400 - 499)
  • CNN (graph ID's 500 - 599)

CyberActive Graph Embedding

As we saw in the analysis there are different types of graphs. In order to identify similarity in dynamic graphs it is important to account for both

  • spatial closeness of different nodes
  • temporal closseness of different nodes

Embeddings of 2 nodes that have great overal in their neighbor nodes should be similar. Closeness in time should also influence the embeddings of 2 nodes to make them more similar.

One more thing is that we are developing an approach to create graph embedding for dynamic graph. The building blocks are subgraphs created from nodes and edges corresponding to a fairly small, predefined, time slice.

First, we convert the cvs data into a sequence of node_edge_node triples.

For example, the following table

source_id source_type destination_id destination_type edge_type graph_id
4 b 77 c u 0
4 a 77 f v 0
4 a 0 d t 0

is encoded as a sentence of three words:

"b_u_c   a_v_f   a_t_d"

The node and edge names were mapped to single letter by the authors according to the map below. These three words using the map are translated as:

b_u_c = thread open file

a_v_f = process read stdin

a_t_d = process mmap2 file

In [2]:
print_dict(node_edge_map, 6)

a:  process     b:  thread     c:  file     d:  MAP_ANONYMOUS     e:  NA     f:  stdin     

g:  stdout     h:  stderr     i:  accept     j:  access     k:  bind     l:  chmod     

m:  clone     n:  close     o:  connect     p:  execve     q:  fstat     r:  ftruncate     

s:  listen     t:  mmap2     u:  open     v:  read     w:  recv     x:  recvfrom     

y:  recvmsg     z:  send     A:  sendmsg     B:  sendto     C:  stat     D:  truncate     

E:  unlink     F:  waitpid     G:  write     H:  writev     

We can think of the triples as words, and sequences of such triples as sentences.

We use gensim package and specifically Doc2Vec functionality to convert each word and ultimately the whole sentence into an embedding of a predefined size.

Here are the processing steps implemented in the function process_data in utils.py:

For each graph type (YouTube, GMail, VGame, Attack, Download, and CNN)

  1. Read data for one graph_id
  2. Chunk the graph into subgraphs of 100 rows
  3. Convert the csv data to sentences of triples
  4. Train a Doc2Vec model using gensim package; convert each sentence into a vector of predefined size
  5. Perform clustering of the sentence-vectors using KMeans from scikit-learn
  6. Visualize the subgraphs with the highest silhoutte score

Silhoutte score measures the quality of clustering. To quote this wikipedia article: The silhouette value is a measure of how similar an object is to its own cluster compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

In [17]:
n_components = 3
number_graphs = 24
display_config = (8,3)
display(Markdown('# For one graph from each graph type: YouTube, GMail, VGame, Attack, Download, and CNN'))
display(Markdown('### ' + 'Cluster the subgraphs into ' + str(n_components) + ' clusters' ))
display(Markdown('### ' + 'For each cluster display the top ' + str(number_graphs) + ' subgraphs' ))

For one graph from each graph type: YouTube, GMail, VGame, Attack, Download, and CNN

Cluster the subgraphs into 3 clusters

For each cluster display the top 24 subgraphs

One Graph From YouTube

In [18]:
%%time
graph_id = 51
graph_type = 'YouTube'
test_data, sample_silhouette_values, km, all_graphs, node_colors = process_data(graph_id, n_components, epochs = 100)
Wall time: 49.1 s
In [19]:
for i_comp in range(n_components):
    display(Markdown('### ' + graph_type + ' Cluster  ' + str(i_comp)))
    ith_indices = np.where(km.labels_ == i_comp)[0]
    ith_cluster_silhouette_values = sample_silhouette_values[ith_indices]
    sorted_index_array = np.argsort(ith_cluster_silhouette_values)[::-1]
    sub_graphs = [all_graphs[i] for i in ith_indices]
    sub_colors  = [node_colors[i] for i in ith_indices]
    visualize_graphs(sub_graphs, sub_colors, sorted_index_array[0:number_graphs], display_config)

YouTube Cluster 0

YouTube Cluster 1

YouTube Cluster 2

One Graph From GMail

In [20]:
%%time
graph_id = 180
graph_type = 'GMail'
test_data, sample_silhouette_values, km, all_graphs, node_colors = process_data(graph_id, n_components, epochs = 100)
Wall time: 18.8 s
In [21]:
for i_comp in range(n_components):
    display(Markdown('### ' + graph_type + ' Cluster  ' + str(i_comp)))
    ith_indices = np.where(km.labels_ == i_comp)[0]
    ith_cluster_silhouette_values = sample_silhouette_values[ith_indices]
    sorted_index_array = np.argsort(ith_cluster_silhouette_values)[::-1]
    sub_graphs = [all_graphs[i] for i in ith_indices]
    sub_colors  = [node_colors[i] for i in ith_indices]
    visualize_graphs(sub_graphs, sub_colors, sorted_index_array[0:number_graphs], display_config)

GMail Cluster 0

GMail Cluster 1

GMail Cluster 2

In [ ]:
 

One Graph From VGame

In [22]:
%%time
graph_id = 251
graph_type = 'VGame'
test_data, sample_silhouette_values, km, all_graphs, node_colors = process_data(graph_id, n_components, epochs = 100)
Wall time: 56.2 s
In [23]:
for i_comp in range(n_components):
    display(Markdown('### ' + graph_type + ' Cluster  ' + str(i_comp)))
    ith_indices = np.where(km.labels_ == i_comp)[0]
    ith_cluster_silhouette_values = sample_silhouette_values[ith_indices]
    sorted_index_array = np.argsort(ith_cluster_silhouette_values)[::-1]
    sub_graphs = [all_graphs[i] for i in ith_indices]
    sub_colors  = [node_colors[i] for i in ith_indices]
    visualize_graphs(sub_graphs, sub_colors, sorted_index_array[0:number_graphs], display_config)

VGame Cluster 0

VGame Cluster 1

VGame Cluster 2

One Graph From Attack

In [24]:
%%time
graph_id = 301
graph_type = 'Attack'
test_data, sample_silhouette_values, km, all_graphs, node_colors = process_data(graph_id, n_components, epochs = 10)
Wall time: 16.1 s
In [25]:
for i_comp in range(n_components):
    display(Markdown('### ' + graph_type + ' Cluster  ' + str(i_comp)))
    ith_indices = np.where(km.labels_ == i_comp)[0]
    ith_cluster_silhouette_values = sample_silhouette_values[ith_indices]
    sorted_index_array = np.argsort(ith_cluster_silhouette_values)[::-1]
    sub_graphs = [all_graphs[i] for i in ith_indices]
    sub_colors  = [node_colors[i] for i in ith_indices]
    visualize_graphs(sub_graphs, sub_colors, sorted_index_array[0:number_graphs], display_config)

Attack Cluster 0

Attack Cluster 1

Attack Cluster 2

One Graph From Download

In [26]:
%%time
graph_id = 450
graph_type = 'Download'
test_data, sample_silhouette_values, km, all_graphs, node_colors = process_data(graph_id, n_components, epochs = 100)
Wall time: 2min 8s
In [27]:
for i_comp in range(n_components):
    display(Markdown('### ' + graph_type + ' Cluster  ' + str(i_comp)))
    ith_indices = np.where(km.labels_ == i_comp)[0]
    ith_cluster_silhouette_values = sample_silhouette_values[ith_indices]
    sorted_index_array = np.argsort(ith_cluster_silhouette_values)[::-1]
    sub_graphs = [all_graphs[i] for i in ith_indices]
    sub_colors  = [node_colors[i] for i in ith_indices]
    visualize_graphs(sub_graphs, sub_colors, sorted_index_array[0:number_graphs], display_config)

Download Cluster 0

Download Cluster 1

Download Cluster 2

In [ ]:
 

One Graph From CNN

In [28]:
%%time
graph_id = 551
graph_type = 'CNN'
test_data, sample_silhouette_values, km, all_graphs, node_colors = process_data(graph_id, n_components, epochs = 100)
Wall time: 1min 25s
In [29]:
for i_comp in range(n_components):
    display(Markdown('### ' + graph_type + ' Cluster ' + str(i_comp)))
    ith_indices = np.where(km.labels_ == i_comp)[0]
    ith_cluster_silhouette_values = sample_silhouette_values[ith_indices]
    sorted_index_array = np.argsort(ith_cluster_silhouette_values)[::-1]
    sub_graphs = [all_graphs[i] for i in ith_indices]
    sub_colors  = [node_colors[i] for i in ith_indices]
    visualize_graphs(sub_graphs, sub_colors, sorted_index_array[0:number_graphs], display_config)

CNN Cluster 0

CNN Cluster 1

CNN Cluster 2

In [ ]:
 
In [ ]: